We tackle open-world semantic segmentation, which aims at learning to segment arbitrary visual concepts in images, by using only image-text pairs without dense annotations. Existing open-world segmentation methods have shown impressive advances by employing contrastive learning (CL) to learn diverse visual concepts and adapting the learned image-level understanding to the segmentation task. However, these methods based on CL have a discrepancy since it only considers image-text level alignment in training time, while the segmentation task requires region-text level alignment at test time. In this paper, we propose a novel Text-grounded Contrastive Learning (TCL) framework to directly align a text and a region described by the text to address the train-test discrepancy. Our method generates a segmentation mask associated with a given text, extracts grounded image embedding from the masked region, and aligns it with text embedding via TCL. The framework addresses the discrepancy by letting the model learn region-text level alignment instead of image-text level alignment and encourages the model to directly improve the quality of generated segmentation masks. In addition, for a rigorous and fair comparison, we present a unified evaluation protocol with widely used 8 semantic segmentation datasets. TCL achieves state-of-the-art zero-shot segmentation performance with large margins in all datasets. Code is available at https://github.com/kakaobrain/tcl.
translated by 谷歌翻译
域的概括(DG)旨在仅使用有限的源域学习一个通用模型。先前的DG尝试仅由于训练和测试域之间的显着域移动而无法从源域中学习域不变表示。取而代之的是,我们使用Oracle模型使用共同信息重新构建了DG目标,该模型将概括为任何可能的域。我们通过通过预训练的模型近似oracle模型来得出一个可拖动的变化下限,称为使用Oracle(Miro)的相互信息正则化。我们的广泛实验表明,Miro可显着提高分布性能。此外,我们的缩放实验表明,预训练模型的尺度越大,miro的性能提高就越大。源代码可在https://github.com/kakaobrain/miro中获得。
translated by 谷歌翻译
最近,自我监督方法在图像级代表学习中表现出显着的成就。尽管如此,它们的图像级自我监督将学习的表示来引导到密集预测任务的次优,例如对象检测,实例分割等来解决这个问题,最近的几个自我监督的学习方法具有扩展图像级单个嵌入到像素级密集嵌入物。与图像级表示学习不同,由于增强的空间变形,难以采样像素级正对。以前的研究使用赢家 - 所有在密集嵌入之间的相似性或阈值距离之间采样像素级正对。然而,这些天真的方法可以通过背景混乱和异常值问题挣扎。在本文中,我们介绍了霍夫对比学习(Houghcl),一种基于Hough空间的方法,该方法强制了两个密集特征之间的几何一致性。 Houghcl实现了对背景杂乱和异常值的鲁棒性。此外,与基线相比,我们密集的正配对方法没有额外的学习参数,并且具有小的额外计算成本。与以前的作品相比,我们的方法在密集的预测微调任务上显示了更好或相当的性能。
translated by 谷歌翻译
域泛化(DG)方法旨在通过仅使用来自源域的训练数据来实现未经证明的目标域的概括性。虽然已经提出了各种DG方法,但最近的一项研究表明,在一个公平的评估方案下,称为域底,简单的经验风险最小化(ERM)方法可与以前的方法相当。不幸的是,简单地解决了ERM在复杂的非凸损函数上,可以通过寻求尖锐的最小值来容易地导致次优化的普遍性。在本文中,我们理论上表明发现扁平最小值导致较小的域泛化差距。我们还提出了一种简单而有效的方法,名为随机重量平均(纵向),找到扁平的最小值。瑞郎发现更漂亮的最小值,并且由于通过密集和过度感知的随机重量采样策略而遭受的过度装备不足。瑞士瑞士展示了五个DG基准测试,即PACS,VLC,OfficeHome,Terraincognita和Domainnet的最先进的表演,符合域名准确度的一致和大幅度+ 1.6%。我们还与常规的泛化方法(如数据增强和一致性正则化方法)进行比较,以验证显着的性能改进是通过寻求扁平的最小值,而不是更好的域概括性。最后但并非最不重要的是,瑞士剧本适应现有的DG方法而无需修改;施联和现有DG方法的组合进一步提高了DG性能。源代码可在https://github.com/khanrc/swad提供。
translated by 谷歌翻译
Harmonic functions are abundant in nature, appearing in limiting cases of Maxwell's, Navier-Stokes equations, the heat and the wave equation. Consequently, there are many applications of harmonic functions, spanning applications from industrial process optimisation to robotic path planning and the calculation of first exit times of random walks. Despite their ubiquity and relevance, there have been few attempts to develop effective means of representing harmonic functions in the context of machine learning architectures, either in machine learning on classical computers, or in the nascent field of quantum machine learning. Architectures which impose or encourage an inductive bias towards harmonic functions would facilitate data-driven modelling and the solution of inverse problems in a range of applications. For classical neural networks, it has already been established how leveraging inductive biases can in general lead to improved performance of learning algorithms. The introduction of such inductive biases within a quantum machine learning setting is instead still in its nascent stages. In this work, we derive exactly-harmonic (conventional- and quantum-) neural networks in two dimensions for simply-connected domains by leveraging the characteristics of holomorphic complex functions. We then demonstrate how these can be approximately extended to multiply-connected two-dimensional domains using techniques inspired by domain decomposition in physics-informed neural networks. We further provide architectures and training protocols to effectively impose approximately harmonic constraints in three dimensions and higher, and as a corollary we report divergence-free network architectures in arbitrary dimensions. Our approaches are demonstrated with applications to heat transfer, electrostatics and robot navigation, with comparisons to physics-informed neural networks included.
translated by 谷歌翻译
We propose Universal Document Processing (UDOP), a foundation Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation. UDOP leverages the spatial correlation between textual content and document image to model image, text, and layout modalities with one uniform representation. With a novel Vision-Text-Layout Transformer, UDOP unifies pretraining and multi-domain downstream tasks into a prompt-based sequence generation scheme. UDOP is pretrained on both large-scale unlabeled document corpora using innovative self-supervised objectives and diverse labeled data. UDOP also learns to generate document images from text and layout modalities via masked image reconstruction. To the best of our knowledge, this is the first time in the field of document AI that one model simultaneously achieves high-quality neural document editing and content customization. Our method sets the state-of-the-art on 9 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites. UDOP ranks first on the leaderboard of the Document Understanding Benchmark (DUE).
translated by 谷歌翻译
We present HOReeNet, which tackles the novel task of manipulating images involving hands, objects, and their interactions. Especially, we are interested in transferring objects of source images to target images and manipulating 3D hand postures to tightly grasp the transferred objects. Furthermore, the manipulation needs to be reflected in the 2D image space. In our reenactment scenario involving hand-object interactions, 3D reconstruction becomes essential as 3D contact reasoning between hands and objects is required to achieve a tight grasp. At the same time, to obtain high-quality 2D images from 3D space, well-designed 3D-to-2D projection and image refinement are required. Our HOReeNet is the first fully differentiable framework proposed for such a task. On hand-object interaction datasets, we compared our HOReeNet to the conventional image translation algorithms and reenactment algorithm. We demonstrated that our approach could achieved the state-of-the-art on the proposed task.
translated by 谷歌翻译
Pretrained Language Models (LMs) memorize a vast amount of knowledge during initial pretraining, including information that may violate the privacy of personal lives and identities. Previous work addressing privacy issues for language models has mostly focused on data preprocessing and differential privacy methods, both requiring re-training the underlying LM. We propose knowledge unlearning as an alternative method to reduce privacy risks for LMs post hoc. We show that simply performing gradient ascent on target token sequences is effective at forgetting them with little to no degradation of general language modeling performances for larger LMs; it sometimes even substantially improves the underlying LM with just a few iterations. We also find that sequential unlearning is better than trying to unlearn all the data at once and that unlearning is highly dependent on which kind of data (domain) is forgotten. By showing comparisons with a previous data preprocessing method and a decoding method known to mitigate privacy risks for LMs, we show that unlearning can give a stronger empirical privacy guarantee in scenarios where the data vulnerable to extraction attacks are known a priori while being much more efficient and robust. We release the code and dataset needed to replicate our results at https://github.com/joeljang/knowledge-unlearning.
translated by 谷歌翻译
显着对象检测(SOD)最近引起了人们的关注,但对高分辨率(HR)图像的研究较少。不幸的是,与低分辨率(LR)图像和注释相比,HR图像及其像素级注释肯定是更耗费劳动力和耗时的。因此,我们建议没有任何HR数据集的HR预测,建议基于图像金字塔的SOD框架,逆显着性金字塔重建网络(INSPYRENET)。我们设计了Inspyrenet,以产生严格的图像金字塔结构,使其能够将多个结果与基于金字塔的图像混合在一起。为了进行HR预测,我们设计了一种金字塔混合方法,该方法从同一图像中从一对LR和HR量表中合成了两个不同的图像金字塔,以克服有效的接受场(ERF)差异。我们对公共LR和HR SOD基准的广泛评估表明,Inspyrenet超过了各种SOD指标和边界准确性的最新方法(SOTA)方法。
translated by 谷歌翻译
机器学习的最新进展表明,通过自我监督的学习获得的预训练表示形式可以通过小型培训数据实现高精度。与视觉和自然语言处理域不同,基于IMU的应用程序的预培训是具有挑战性的,因为只有少数公开可用的数据集具有足够的规模和多样性来学习可推广的表示。为了克服这个问题,我们提出了IMG2IMU,这是一种新颖的方法,可以适应从大规模图像到不同弹药的IMU感应任务的预训练表示。我们将传感器数据转换为可解释的频谱图,以便模型利用从视觉中获得的知识。此外,我们将对比度学习应用于我们旨在学习用于解释传感器数据的表示形式。我们对五个IMU感应任务的广泛评估表明,IMG2IMU始终优于基准,这说明视力知识可以纳入一些用于IMU感应任务的学习环境中。
translated by 谷歌翻译